Clustering High Dimension, Low Sample Size Data Using the Maximal Data Piling Distance

نویسندگان

  • Jeongyoun Ahn
  • Myung Hee Lee
  • Young Joo Yoon
  • JEONGYOUN AHN
  • MYUNG HEE LEE
  • YOUNG JOO YOON
چکیده

We propose a new hierarchical clustering method for high dimension, low sample size (HDLSS) data. The method utilizes the fact that each individual data vector accounts for exactly one dimension in the subspace generated by HDLSS data. The linkage that is used for measuring the distance between clusters is the orthogonal distance between affine subspaces generated by each cluster. The ideal implementation would be to consider all possible binary splits of the data and choose the one that maximizes the distance in between. Since this is not computationally feasible in general, we use the singular value decomposition for its approximation. We provide theoretical justification of the method by studying high dimensional asymptotics. Also we obtain the probability distribution of the distance measure under the null hypothesis of no split, which we use to propose a criterion for determining the number of clusters. Simulation and data analysis with microarray data show competitive clustering performance of the proposed method.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Clustering for high-dimension, low-sample size data using distance vectors

In high-dimension, low-sample size (HDLSS) data, it is not always true that closeness of two objects reflects a hidden cluster structure. We point out the important fact that it is not the closeness, but the “values” of distance that contain information of the cluster structure in highdimensional space. Based on this fact, we propose an efficient and simple clustering approach, called distance ...

متن کامل

Distance Weighted Discrimination

High Dimension Low Sample Size statistical analysis is becoming increasingly important in a wide range of applied contexts. In such situations, it is seen that the popular Support Vector Machine suffers from “data piling” at the margin, which can diminish generalizability. This leads naturally to the development of Distance Weighted Discrimination, which is based on Second Order Cone Programmin...

متن کامل

Distance Weighted Discrimination

High Dimension Low Sample Size statistical analysis is becoming increasingly important in a wide range of applied contexts. In such situations, it is seen that the appealing discrimination method called the Support Vector Machine can be improved. The revealing concept is “data piling” at the margin. This leads naturally to the development of “Distance Weighted Discrimination,” which also is bas...

متن کامل

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

In the so-called high dimensional, low sample size (HDLSS) settings, LDA possesses the “data piling” property, that is, it maps all points from the same class in the training data to a common point, and so when viewed along the LDA projection directions, the data are piled up. Data piling indicates overfitting and usually results in poor out-of-sample classification. In this paper, a novel appr...

متن کامل

Sparse Linear Discriminant Analysis with Applications to High Dimensional Low Sample Size Data

This paper develops a method for automatically incorporating variable selection in Fisher’s linear discriminant analysis (LDA). Utilizing the connection of Fisher’s LDA and a generalized eigenvalue problem, our approach applies the method of regularization to obtain sparse linear discriminant vectors, where “sparse” means that the discriminant vectors have only a small number of nonzero compone...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012